1,736,729 research outputs found
Classifying document types to enhance search and recommendations in digital libraries
In this paper, we address the problem of classifying documents available from
the global network of (open access) repositories according to their type. We
show that the metadata provided by repositories enabling us to distinguish
research papers, thesis and slides are missing in over 60% of cases. While
these metadata describing document types are useful in a variety of scenarios
ranging from research analytics to improving search and recommender (SR)
systems, this problem has not yet been sufficiently addressed in the context of
the repositories infrastructure. We have developed a new approach for
classifying document types using supervised machine learning based exclusively
on text specific features. We achieve 0.96 F1-score using the random forest and
Adaboost classifiers, which are the best performing models on our data. By
analysing the SR system logs of the CORE [1] digital library aggregator, we
show that users are an order of magnitude more likely to click on research
papers and thesis than on slides. This suggests that using document types as a
feature for ranking/filtering SR results in digital libraries has the potential
to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of
Digital Libraries (TPDL), 2017, Thessaloniki, Greec
Unsupervised learning of document image types
In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way to do that is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. Presented in this thesis is a clustering scheme that is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovers document types at an acceptable level and confidently classifies unknown document images
Folksonomies vs. Bag-of-Words: The Evaluation & Comparison of Different Types of Document Representations
published or submitted for publicationis peer reviewe
IVOA Recommendation: VOResource: an XML Encoding Schema for Resource Metadata Version 1.03
This document describes an XML encoding standard for IVOA Resource Metadata,
referred to as VOResource. This schema is primarily intended to support
interoperable registries used for discovering resources; however, any
application that needs to describe resources may use this schema. In this
document, we define the types and elements that make up the schema as
representations of metadata terms defined in the IVOA standard, Resource
Metadata for the Virtual Observatory [Hanicsh et al. 2004]. We also describe
the general model for the schema and explain how it may be extended to add new
metadata terms and describe more specific types of resources
What makes papers visible on social media? An analysis of various document characteristics
In this study we have investigated the relationship between different
document characteristics and the number of Mendeley readership counts, tweets,
Facebook posts, mentions in blogs and mainstream media for 1.3 million papers
published in journals covered by the Web of Science (WoS). It aims to
demonstrate that how factors affecting various social media-based indicators
differ from those influencing citations and which document types are more
popular across different platforms. Our results highlight the heterogeneous
nature of altmetrics, which encompasses different types of uses and user groups
engaging with research on social media.Comment: Presented at the 21th International Conference in Science &
Technology Indicators (STI), 13-16, September, 2016, Valencia, Spai
Rewrite based Verification of XML Updates
We consider problems of access control for update of XML documents. In the
context of XML programming, types can be viewed as hedge automata, and static
type checking amounts to verify that a program always converts valid source
documents into also valid output documents. Given a set of update operations we
are particularly interested by checking safety properties such as preservation
of document types along any sequence of updates. We are also interested by the
related policy consistency problem, that is detecting whether a sequence of
authorized operations can simulate a forbidden one. We reduce these questions
to type checking problems, solved by computing variants of hedge automata
characterizing the set of ancestors and descendants of the initial document
type for the closure of parameterized rewrite rules
A granular approach to web search result presentation
In this paper we propose and evaluate interfaces for presenting the results of web searches. Sentences, taken from the top retrieved documents, are used as fine-grained representations of document content and, when combined in a ranked list, to provide a query-specific overview of the set of retrieved documents. Current search engine interfaces assume users examine such results document-by-document. In contrast our approach groups, ranks and presents the contents of the top ranked document set. We evaluate our hypotheses that the use of such an approach can lead to more effective web searching and to increased user satisfaction. Our evaluation, with real users and different types of information seeking scenario, showed, with statistical significance, that these hypotheses hold
- …